69 research outputs found

    Representing and analysing molecular and cellular function in the computer

    Get PDF
    Determining the biological function of a myriad of genes, and understanding how they interact to yield a living cell, is the major challenge of the post genome-sequencing era. The complexity of biological systems is such that this cannot be envisaged without the help of powerful computer systems capable of representing and analysing the intricate networks of physical and functional interactions between the different cellular components. In this review we try to provide the reader with an appreciation of where we stand in this regard. We discuss some of the inherent problems in describing the different facets of biological function, give an overview of how information on function is currently represented in the major biological databases, and describe different systems for organising and categorising the functions of gene products. In a second part, we present a new general data model, currently under development, which describes information on molecular function and cellular processes in a rigorous manner. The model is capable of representing a large variety of biochemical processes, including metabolic pathways, regulation of gene expression and signal transduction. It also incorporates taxonomies for categorising molecular entities, interactions and processes, and it offers means of viewing the information at different levels of resolution, and dealing with incomplete knowledge. The data model has been implemented in the database on protein function and cellular processes 'aMAZE' (http://www.ebi.ac.uk/research/pfbp/), which presently covers metabolic pathways and their regulation. Several tools for querying, displaying, and performing analyses on such pathways are briefly described in order to illustrate the practical applications enabled by the model

    A universally applicable method of operon map prediction on minimally annotated genomes using conserved genomic context.

    Get PDF
    An important step in understanding the regulation of a prokaryotic genome is the generation of its transcription unit map. The current strongest operon predictor depends on the distributions of intergenic distances (IGD) separating adjacent genes within and between operons. Unfortunately, experimental data on these distance distributions are limited to Escherichia coli and Bacillus subtilis. We suggest a new graph algorithmic approach based on comparative genomics to identify clusters of conserved genes independent of IGD and conservation of gene order. As a consequence, distance distributions of operon pairs for any arbitrary prokaryotic genome can be inferred. For E.coli, the algorithm predicts 854 conserved adjacent pairs with a precision of 85%. The IGD distribution for these pairs is virtually identical to the E.coli operon pair distribution. Statistical analysis of the predicted pair IGD distribution allows estimation of a genome-specific operon IGD cut-off, obviating the requirement for a training set in IGD-based operon prediction. We apply the method to a representative set of eight genomes, and show that these genome-specific IGD distributions differ considerably from each other and from the distribution in E.coli

    Estimating translational selection in Eukaryotic Genomes

    Get PDF
    Natural selection on codon usage is a pervasive force that acts on a large variety of prokaryotic and eukaryotic genomes. Despite this, obtaining reliable estimates of selection on codon usage has proved complicated, perhaps due to the fact that the selection coefficients involved are very small. In this work, a population genetics model is used to measure the strength of selected codon usage bias, S, in 10 eukaryotic genomes. It is shown that the strength of selection is closely linked to expression and that reliable estimates of selection coefficients can only be obtained for genes with very similar expression levels. We compare the strength of selected codon usage for orthologous genes across all 10 genomes classified according to expression categories. Fungi genomes present the largest S values (2.24–2.56), whereas multicellular invertebrate and plant genomes present more moderate values (0.61–1.91). The large mammalian genomes (human and mouse) show low S values (0.22–0.51) for the most highly expressed genes. This might not be evidence for selection in these organisms as the technique used here to estimate S does not properly account for nucleotide composition heterogeneity along such genomes. The relationship between estimated S values and empirical estimates of population size is presented here for the first time. It is shown, as theoretically expected, that population size has an important role in the operativity of translational selection

    Quantification of global transcription patterns in prokaryotes using spotted microarrays

    Get PDF
    We describe an analysis, applicable to any spotted microarray dataset produced using genomic DNA as a reference, that quantifies prokaryotic levels of mRNA on a genome-wide scale. Applying this to Mycobacterium tuberculosis, we validate the technique, show a correlation between level of expression and biological importance, define the complement of invariant genes and analyze absolute levels of expression by functional class to develop ways of understanding an organism's biology without comparison to another growth condition

    An Exact Algorithm for Side-Chain Placement in Protein Design

    Get PDF
    Computational protein design aims at constructing novel or improved functions on the structure of a given protein backbone and has important applications in the pharmaceutical and biotechnical industry. The underlying combinatorial side-chain placement problem consists of choosing a side-chain placement for each residue position such that the resulting overall energy is minimum. The choice of the side-chain then also determines the amino acid for this position. Many algorithms for this NP-hard problem have been proposed in the context of homology modeling, which, however, reach their limits when faced with large protein design instances. In this paper, we propose a new exact method for the side-chain placement problem that works well even for large instance sizes as they appear in protein design. Our main contribution is a dedicated branch-and-bound algorithm that combines tight upper and lower bounds resulting from a novel Lagrangian relaxation approach for side-chain placement. Our experimental results show that our method outperforms alternative state-of-the art exact approaches and makes it possible to optimally solve large protein design instances routinely

    Imputation of ordinal outcomes: a comparison of approaches in Traumatic Brain Injury

    Get PDF
    Loss to follow-up and missing outcomes data are important issues for longitudinal observational studies and clinical trials in traumatic brain injury. One popular solution to missing 6-month outcomes has been to use the last observation carried forward (LOCF). The purpose of the current study was to compare the performance of model-based single-imputation methods with that of the LOCF approach. We hypothesized that model-based methods would perform better as they potentially make better use of available outcome data. The Collaborative European NeuroTrauma Effectiveness Research in Traumatic Brain Injury (CENTER-TBI) study (n = 4509) included longitudinal outcome collection at 2 weeks, 3 months, 6 months, and 12 months post-injury; a total of 8185 Glasgow Outcome Scale extended (GOSe) observations were included in the database. We compared single imputation of 6-month outcomes using LOCF, a multiple imputation (MI) panel imputation, a mixed-effect model, a Gaussian process regression, and a multi-state model. Model performance was assessed via cross-validation on the subset of individuals with a valid GOSe value within 180 +/- 14 days post-injury (n = 1083). All models were fit on the entire available data after removing the 180 +/- 14 days post-injury observations from the respective test fold. The LOCF method showed lower accuracy (i.e., poorer agreement between imputed and observed values) than model-based methods of imputation, and showed a strong negative bias (i.e., it imputed lower than observed outcomes). Accuracy and bias for the model-based approaches were similar to one another, with the multi-state model having the best overall performance. All methods of imputation showed variation across different outcome categories, with better performance for more frequent outcomes. We conclude that model-based methods of single imputation have substantial performance advantages over LOCF, in addition to providing more complete outcome data.Development and application of statistical models for medical scientific researc

    The combination of autofluorescence endoscopy and molecular biomarkers is a novel diagnostic tool for dysplasia in Barrett's oesophagus.

    Get PDF
    OBJECTIVE: Endoscopic surveillance for Barrett's oesophagus (BO) is limited by sampling error and the subjectivity of diagnosing dysplasia. We aimed to compare a biomarker panel on minimal biopsies directed by autofluorescence imaging (AFI) with the standard surveillance protocol to derive an objective tool for dysplasia assessment. DESIGN: We performed a cross-sectional prospective study in three tertiary referral centres. Patients with BO underwent high-resolution endoscopy followed by AFI-targeted biopsies. 157 patients completed the biopsy protocol. Aneuploidy/tetraploidy; 9p and 17p loss of heterozygosity; RUNX3, HPP1 and p16 methylation; p53 and cyclin A immunohistochemistry were assessed. Bootstrap resampling was used to select the best diagnostic biomarker panel for high-grade dysplasia (HGD) and early cancer (EC). This panel was validated in an independent cohort of 46 patients. RESULTS: Aneuploidy, p53 immunohistochemistry and cyclin A had the strongest association with dysplasia in the per-biopsy analysis and, as a panel, had an area under the receiver operating characteristic curve of 0.97 (95% CI 0.95 to 0.99) for diagnosing HGD/EC. The diagnostic accuracy for HGD/EC of the three-biomarker panel from AFI+ areas was superior to AFI- areas (p<0.001). Compared with the standard protocol, this panel had equal sensitivity for HGD/EC, with a 4.5-fold reduction in the number of biopsies. In an independent cohort of patients, the panel had a sensitivity and specificity for HGD/EC of 100% and 85%, respectively. CONCLUSIONS: A three-biomarker panel on a small number of AFI-targeted biopsies provides an accurate and objective diagnosis of dysplasia in BO. The clinical implications have to be studied further

    Factor analysis for gene regulatory networks and transcription factor activity profiles

    Get PDF
    BACKGROUND: Most existing algorithms for the inference of the structure of gene regulatory networks from gene expression data assume that the activity levels of transcription factors (TFs) are proportional to their mRNA levels. This assumption is invalid for most biological systems. However, one might be able to reconstruct unobserved activity profiles of TFs from the expression profiles of target genes. A simple model is a two-layer network with unobserved TF variables in the first layer and observed gene expression variables in the second layer. TFs are connected to regulated genes by weighted edges. The weights, known as factor loadings, indicate the strength and direction of regulation. Of particular interest are methods that produce sparse networks, networks with few edges, since it is known that most genes are regulated by only a small number of TFs, and most TFs regulate only a small number of genes. RESULTS: In this paper, we explore the performance of five factor analysis algorithms, Bayesian as well as classical, on problems with biological context using both simulated and real data. Factor analysis (FA) models are used in order to describe a larger number of observed variables by a smaller number of unobserved variables, the factors, whereby all correlation between observed variables is explained by common factors. Bayesian FA methods allow one to infer sparse networks by enforcing sparsity through priors. In contrast, in the classical FA, matrix rotation methods are used to enforce sparsity and thus to increase the interpretability of the inferred factor loadings matrix. However, we also show that Bayesian FA models that do not impose sparsity through the priors can still be used for the reconstruction of a gene regulatory network if applied in conjunction with matrix rotation methods. Finally, we show the added advantage of merging the information derived from all algorithms in order to obtain a combined result. CONCLUSION: Most of the algorithms tested are successful in reconstructing the connectivity structure as well as the TF profiles. Moreover, we demonstrate that if the underlying network is sparse it is still possible to reconstruct hidden activity profiles of TFs to some degree without prior connectivity information

    PROTDES: CHARMM toolbox for computational protein design

    Get PDF
    We present an open-source software able to automatically mutate any residue positions and find the best aminoacids in an arbitrary protein structure without requiring pairwise approximations. Our software, PROTDES, is based on CHARMM and it searches automatically for mutations optimizing a protein folding free energy. PROTDES allows the integration of molecular dynamics within the protein design. We have implemented an heuristic optimization algorithm that iteratively searches the best aminoacids and their conformations for an arbitrary set of positions within a structure. Our software allows CHARMM users to perform protein design calculations and to create their own procedures for protein design using their own energy functions. We show this by implementing three different energy functions based on different solvent treatments: surface area accessibility, generalized Born using molecular volume and an effective energy function. PROTDES, a tutorial, parameter sets, configuration tools and examples are freely available at http://soft.synth-bio.org/protdes.html

    A highly conserved transcriptional repressor controls a large regulon involved in lipid degradation in Mycobacterium smegmatis and Mycobacterium tuberculosis

    Get PDF
    The Mycobacterium tuberculosis TetR-type regulator Rv3574 has been implicated in pathogenesis as it is induced in vivo, and genome-wide essentiality studies show it is required for infection. As the gene is highly conserved in the mycobacteria, we deleted the Rv3574 orthologue in Mycobacterium smegmatis (MSMEG_6042) and used real-time quantitative polymerase chain reaction and microarray analyses to show that it represses the transcription both of itself and of a large number of genes involved in lipid metabolism. We identified a conserved motif within its own promoter (TnnAACnnGTTnnA) and showed that it binds as a dimer to 29 bp probes containing the motif. We found 16 and 31 other instances of the motif in intergenic regions of M. tuberculosis and M. smegmatis respectively. Combining the results of the microarray studies with the motif analyses, we predict that Rv3574 directly controls the expression of 83 genes in M. smegmatis, and 74 in M. tuberculosis. Many of these genes are known to be induced by growth on cholesterol in rhodococci, and palmitate in M. tuberculosis. We conclude that this regulator, designated elsewhere as kstR, controls the expression of genes used for utilizing diverse lipids as energy sources, possibly imported through the mce4 system
    corecore